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Abstract 

We propose Neural Reasoner , a framework for neural network-based reasoning 
over natural language sentences. Given a question, Neural Reasoner can infer over 
multiple supporting facts and find an answer to the question in specific forms. Neu¬ 
ral Reasoner has 1) a specific interaction-pooling mechanism, allowing it to examine 
multiple facts, and 2) a deep architecture, allowing it to model the complicated logical 
relations in reasoning tasks. Assuming no particular structure exists in the question and 
facts. Neural Reasoner is able to accommodate different types of reasoning and dif¬ 
ferent forms of language expressions. Despite the model complexity. Neural Reasoner 
can still be trained effectively in an end-to-end manner. Our empirical studies show that 
Neural Reasoner can outperform existing neural reasoning systems with remarkable 
margins on two difficult artihcial tasks (Positional Reasoning and Path Finding) proposed 
in [5]. For example, it improves the accuracy on Path Finding(lOK) from 33.4% [S] to over 
98%. 


1 Introduction 

Reasoning is essential to natural language processing tasks, most obviously in examples like 
document summarization, question-answering, and dialogue. Previous efforts in this direction 
are built on rule-based models, requiring first mapping natural languages to logic forms and 
then inference over them. The mapping (roughly corresponding to semantic parsing), and the 
inference, are by no means easy, given the variability and flexibility of natural language, the 
variety of the reasoning tasks, and the brittleness of a rule-based system. 

Just recently, there is some new effort, mainly represented by Memory Network and its 
dynamic variants trying to build a purely neural network-based reasoning system with 

fully distributed semantics that can infer over multiple facts to answer simple questions, all 
in natural language, e.g., 

Factl: John travelled to the hallway. 

Fact2: Mary journeyed to the bathroom. 

Question: Where is Mary? 

The Memory Nets perform fairly well on simple tasks like the examples above, but poorly 

*The work is done when the first author worked as intern at Noah’s Ark Lab, Huawei Technologies. 
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on more complicated ones due to their simple and rigid way of modeling the dynamics of 
question-fact interaction and the complex process of reasoning. 

In this paper we give a more systematic treatment of the problem and propose a flexible 
neural reasoning system, named Neural Reasoner. It is purely neural network based and 
can be trained in an end-to-end way [6], using only supervision from the final answer. Our 
contributions are mainly two-folds 

• we propose a novel neural reasoning system Neural Reasoner that can infer over 
multiple facts in a way insensitive to 1) the number of supporting facts, 2)the form of 
language, and 3) the type of reasoning; 

• we give a particular instantiation of Neural Reasoner and a multi-task training 
method for effectively fitting the model with relatively small amount of data, yielding 
significantly better results than existing neural models on two artificial reasoning task; 


2 Overview of Neural Reasoner 


Neural Reasoner has a layered architecture to deal with the complicated logical relations 
in reasoning, as illustrated in Figure It consists of one encoding layer and multiple reasoning 
layers. The encoder layer hrst converts the question and facts from natural language sentences 
to vectorial representations. More specifically. 
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Figure 1: High level system diagram of Neural Reasoner . 
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We argue that Neural Reasoner has the following desired properties: 

• it can handle varying number of facts, including irrelevant ones, and reach the final 
conclusion through repeated processing of hltering and combining; 

• it makes no assumption about the form of language, as long as enough training examples 
are given. 


3 Model 


In this section we give an instantiation of Neural Reasoner described in Section as 
illustrated in Figure [2j In a nutshell, question and facts, as symbol sequences, are first 
converted to vectorial representations in the encoding layer via recurrent neural networks 
(RNNs). The vectorial representations are then fed to the reasoning layers, where the question 
and the facts get updated through an nonlinear transformation jointly controlled by deep 
neural networks (DNNs)and pooling. Finally at the answering layer, the resulted question 
representation is used to generate the hnal answer to the question. More specifically 


• in the encoding layer (Layer-0) we use recurrent neural networks (RNNs) to convert 
question and facts to their vectorial representations, which are then forwarded to the 
hrst reasoning layer; 


• in each reasoning layer (i.e., Layer-£ with 1 < £ < L — 1), we use a deep neural network 
(denoted as DNN^) to model the pairwise interaction between question representation 
q(^-i) and each fact representation from the previous layer, which yields updated 

fact representation and updated (fact-dependent) question representation 


we then fuse the individual updated fact representations {q[^\q 2 ^V'' for the 


global updated representation q(^) 
more details) 


through a pooling operation (see Section 


3.2 


for 


• hnally in Layer-L, the interaction net (DNN/,) returns only question update, which, after 
summarization by the pooling operation, will serve as input to the Answering Layer. 


In the rest of this section, we will give details of different components of the model. 


3.1 Encoding Layer 

The encoding layer is designed to find semantic representations of question and facts. Suppose 
that we are given a fact or a question as word sequence {xi, • • • ,xt}, the encoding module 
summarizes the word sequence with a vector with fixed length. We have different modeling 
choices for this purpose, e.g., CNN [3] and RNN |3, while in this paper we use GRU [2j, a 
variant of RNN, as the encoding module. GRU is shown to be able to alleviate the gradient 
vanishing issue of RNN and have similar performance to the more complicated LSTM [3]. 

As shown in Figure]^ GRU takes as input a sequence of word vectors (for either question 
or facts) 

X = {xi, • • • ,xt}, Xj E (1) 
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Figure 2; A diagram of our implementation of Neural Reasoner with L reasoning layers, 
operating on one question and K facts. 


where | V| stands for the size of vocabulary for input sentences. Detailed forward computations 
are as follows: 


Zt 

= a(WxzExi + Whzhi_i) 

(2) 

rt 

= CT(WxrEXi + Whrhi_i) 

(3) 

hi 

= tanh(WxhExi + Uhh(ri © hi_i)) 

(4) 

hi 

= (1 - Zt) 0 hi_i + Zi 0 hi 

(5) 


where E G jg word embedding and W 3 ./J, are weight 

matrices. We take the last hidden state hj as the representation of the word sequence. 
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the triangle is above the pink rectangle EOS 


Figure 3: The RNN encoder. The last state is used to summarize the word sequence. 

3.2 Reasoning Layers 

The modules in the reasoning layers include those for question-fact interaction, pooling. 

3.2.1 Question-Fact Interaction 

On reasoning layer t", the interaction is between and resulting in updated 

representations and 

[qf .ffl '='®NN,(l(q''-'>)yfr‘>"iy e,). (6) 

with Qi being the parameters. In general, q ^' and ' can be of different dimensionality as 
those of the previous layers. In the simplest case with a single layer in DNN^, we have 

qf "-(w/|(q<'-‘>)yfr-'>"l + b,), (7) 

where a{-) stands for the nonlinear activation function. 

Roughly speaking, q ^' contains the update of the system’s understanding on answering 
the question after its interaction with fact K, while records the change of the fact. 
Therefore, constitute the “state” of the reasoning process. 

3.2.2 Pooling 

Pooling aims to fuse the understanding of the question right after its interaction with all the 
facts to form the current status of the question, through which we can enable the comparison 
between different facts. There are several strategies for this pooling 

• Average/Max Pooling: To obtain the element in q(^\ we can take the average 
or the maximum of the elements at the same location from • • • q^^}- For example, 
with max-pooling, we have 

q(^)(d) =max({qf^(d),q^^^(d),--- ,q^^(d)}), d=l,2,--- ,Di 

where q^^)(d) stands for the element of vector q^^). Clearly this kind of pooling is 
the simplest, without any associated parameters; 

• Gating: We can have an extra gating network to determine the certainty of 

the features in based on (the input for getting q[f^). The output 

^W(q(^-i)^has the same dimension as q^^\ whose element, after normaliza¬ 
tion, can be used as weight for the corresponding element in q^^^ in obtaining q^^^. 
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• Model-based: In the case of temporal-reasoning, there is crucial information in the 
sequential order of the facts. To account for this temporal structure, we can use a CNN 
or RNN to combine the information in • • • q^^}- 

At layer-L, the query representation q*-^) after the pooling will serve as the features for the 
final decision. 

3.3 Answering Layer 

For simplicity, we focus on the reasoning tasks which can be formulated as classification with 
predetermined classes. More specifically, we apply Neural Reasoner to deal with the 
following two types of questions 

• Type I: General questions, i.e., questions with Yes-No answer; 

• Type II: Special questions with a small set of candidate answers. 

At reasoning Layer-L, it performs pooling over the intermediate results to select important 
information for further uses. 

q = pool({q^^\q^^V-- ,qx^}) ( 8 ) 

y = softmax(W,'^ft^ 3 ,,q('^)) (9) 

After reaching the last reasoning step, in this paper we take two steps, is sent to a standard 
softmax layer to generate an answer which is formulated as a classification problem. 

There is another type of prediction as classification where the effective classes dynamically 
change with instances, e.g., the Single-Supporting-Fact task in [9j. Those tasks cannot be 
directly solved with Neural Reasoner. One simple way to circumvent this is to define the 
following score function 

score^ = 5 'match(q*'^\w^; 6 ') 

where ( 7 match is a function (e.g., a DNN) parameterized with 6, and is the embedding for 
class z, with z being dynamically determined for the task. 

3.4 Training 

The training of model tunes the parameters in {RNNq, DNNi, • • • , DNN^,} and those in the 
softmax classifier. Similar to [ 6 ], we perform end-to-end training, taking the final answer as 
the only supervision. More specifically. We use the cross entropy for the cost of classification 

^'reasoning — ^ ^ DcE{p{y\'l"n)\\yn) 
n£T 

where n indexes the instances in the training set T, and = {Qn, Fn,i, • • ■ j stands 

for question and facts for the instance. 

Our end-to-end training is the same as [ 6 ], while the training in [9] and [5] use the step- 
by-step labels on the supporting facts for each instance (see Table for examples) in addition 
to the answer. As described in [ 6 ], those extra labels brings much stronger supervision just 
the answer in the end-to-end learning setting, and typically yield significantly better result 
on relatively complicated tasks. 
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4 Auxiliary Training for Question/Fact Representation 


We use auxiliary training to facilitate the learning of representations of question and facts. 
Basically, in addition to using the learned representations of question and facts in the reasoning 
process, we also use those representations to reconstruct the original questions or their more 
abstract forms with variables (elaborated later in Section 4.2). 

In the auxiliary training, we intend to achieve the following two goals 


• to compensate the lack of supervision in the learning task. In our experiments, the 
supervision can be fairly weak since for each instance it is merely a classification with 
no more than 12 classes, while the number of instances are IK to lOK. 


• to introduce beneficial bias for the representation learning task. Since the network is a 
complicated nonlinear function, the back-propagation from the answering layer to the 
encoding layer can easily fail to learn well. 


The triangle is above the pink rectangle 



The triangle is above the pink rectangle 


X is above Y 



The triangle is above the pink rectangle 


(a) reconstructing the original sentence 


(b) producing an abstract form with variables 


Figure 4: Auxiliary training for question representation. The training for fact representation 
is identical and therefore omitted. 


4.1 Multi-task Learning Setting 

As illustrated in Figure we take the simplest way to fuse the auxiliary tasks (recovering) 
with the main task (reasoning) through linearly combining their costs with trade-off parameter 
a 


E = aE, 


recovering 


-(- (1 — a)E, 


reasoning 


( 10 ) 


whereKreasoning is the cross entropy loss describing the discrepancy of model prediction from 
correct answer (see Section 3.4), and ^recovering is the negative log-likelihood of the sequences 
(question or facts) to be recovered. More specifically, 


log p{Fn,k\fnl) + logP(<3n|q^f^)} 

nGT k=l 

where the likelihood is estimated as in the encoder-decoder framework proposed in [2]. On 
top of the encoding layer (RNN), we add another decoding layer (RNN) which is trained to 
sequentially predict words in the original sentence. 
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4.2 Abstract Forms with Variables 


Instead of recovering the original sentence in question and facts, we also study the effect of 
producing a more abstract form in the auxiliary training task. More specifically, we let the 
decoding RNN to recover a sentence with entities replaced with variables (treated as particular 
symbols), e.g., 

recover 

The triangle is above the pink rectangle. - >x is above y. 

recover 

The blue square is to the left of the triangle. - >z is to the left of x. 

recover 

Is the pink rectatngle to the right of the square? ->'ls y to the right of the z ? 

Through this, we intend to teach the system a more abstract way of representing sentences 
(both question and facts) and their interactions. More specifically, 

• all the entities are only meaningful only when they are compared with each other. 
In other words, the model (in the encoding and reasoning layers) should not consider 
specihc entities, but their general notions. 

• it helps the model to focus on the relations between the entities, the commonality of 
different facts, and the patterns shared between different instances. 


5 Experiments 

We report our empirical study on applying Neural Reasoner to the Question Answer task 
defined in [8], and compare it against state-of-the-art neural models BM- 


5.1 Setup 

bAbI is a synthetic question and answering dataset. It contains 20 tasks, and each of them 
is composed of a set of facts, a question and followed by an answer which is mostly a single 
word. For most of the time, only a subset of facts are relevant to the given question. Two 
versions of the data are available, one has IK training instances per task and the other has 
lOK instances per task, while the testing set are the same for the two versions. 

We select the two most challenging tasks (among the 20 tasks in [8] ) Positional Reasoning 
and Path Finding, to test the reasoning ability of Neural Reasoner. Positional Reasoning 
task tests model’s spatial reasoning ability, while Path Finding task, hrst proposed in [T] tests 
the ability to reason the correct path between objects based on natural language instructions. 
In Table [T| we give an instance of each task. 


5.2 Implementation Details 

In our experiments, we actually used a simplified version of Neural Reasoner . In the 
version 

• we choose to keep the representation un-updated on each layer, e.g., 

encode (q) _ ^(1) _____ ^(L-1) ]^ — I 2 ■ ■ ■ K 

This choice pushes the update (and its summarization q^^^) to record all the infor¬ 
mation in the interaction between facts and question. 






Task I: path finding 

1. The office is east of the hallway. 

2. The kitchen is north of the office. 

3. The garden is west of the bedroom. 

4. The office is west of the garden. 

5. The bathroom is north of the garden. 

How do you go from the kitchen to the garden? south, east, relies on 2 and 4 
How do you go from the office to the bathroom? east, north, relies on 4 and 5 

Task II: positional reasoning 

1. The triangle is above the pink rectauigle. 

2. The blue square is to the left of the triangle. 

Is the pink rectangle to the right of the blue square? Yes, relies on 1 and 2 
Is the blue square below the pink rectangle? No, relies on 1 and 2 


Table 1: Samples of the two tasks: path finding (upper panel) and positional reasoning (lower 
panel), with facts, questions and given answers (following each question). For each panel, we 
first list facts and then question that one needs to answer based on the given facts. On Task 
I, the answer to the first question is south, east, standing for going south first and then east, 
obtained based on fact 2 and 4. 

• we use only two layers, i.e., L = 2, for the relatively simple task in the experiments. 

Our model was trained with the standard back-propagation (BP) aiming to maximize 
the likelihood of correct answers. All the parameters including the word-embeddings were 
initialized by randomly sampling from a uniform distribution [-0.1, 0.1]. No momentum and 
weight decay was used. We trained all the tasks for 200 epochs with stochastic gradient 
descent and the gradients which had £2 norm larger than 40 were clipped, learning rate being 
controlled by AdaDelta |10] . For multi-task learning, different mixture ratios were tried, from 
0.1 to 0.9. 

5.3 Neural Reasoner vs. Competitor Models 

We compare Neural Reasoner with the following three neural reasoning models: l)Memory 
Network, including the one with step-by-step supervision [^(denoted as Memory Net-Step) 
and the end-to-end version [6] (denoted as Memory Net-N2N), and 2) Dynamic Memory 
Network, proposed in [5], also with step-by-step supervision. In Table we report the 
performance of a particular case of Neural Reasoner with 1) two reasoning layers, 2) 
2-layer DNNs as the interaction modules in each reasoning layer, and 3) auxiliary task of 
recovering the original question and facts. The results are compared against three neural 
competitors. We have the following observations. 

• The proposed Neural Reasoner performs significantly better than Memory Net-N2N, 
especially with more training data. 

• Although not a fair comparison to our model. Neural Reasoner is actually better 
than Memory Net-N2N and Dynamic Memory Net on Positional Reasoning (IK) 
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&: (lOK) as well as Path Finding (lOK), with about 20% margin on both tasks with lOK 
training instances. 



Posi. Reason. (IK) 

Posi. Reason. (lOK) 

Step-by-step Supervision 



Memory Net-step 

65.0% 

75.4% 

Dynamic Memory Net 

59.6% 

- 

End-to-End 



Memory Net-N2N 

59.6% 

60.3% 

Neural Reasoner 

66.4% 

97.9% 



Path Finding (IK) 

Path Finding (lOK) 

Step-by-step Supervision 



Memory Net-step 

36.0% 

68.1% 

Dynamic Memory Net 

34.5% 

- 

End-to-End 



Memory Net-N2N 

17.2% 

33.4% 

Neural Reasoner 

17.3% 

87.0% 


Table 2: Results on two reasoning tasks. The results of Memory Net-step, Memory 
Net-N2N, and Dynamic Memory Net are taken respectively from [9],[6] and [5]. 

Please note that the results of Neural Reasoner reported in Table are not based 
on architectures specifically tuned for the tasks. As a matter of fact, with more complicated 
models (more reasoning layers and deeper interaction modules), we can achieve even better 
results on large datasets (e.g., over 98% accuracy on Path Finding with lOK instances). We 
will however leave the discussion on different architectural variants to the next section. 

5.4 Architectural Variations 

This section is devoted to the study of architectural variants of Neural Reasoner. More 
specifically, we consider the variations in l)the number of reasoning layers, 2) the depth of 
the interaction DNN, and 3) the auxiliary tasks, with results summarized by Table We 
have the following observations; 

• Auxiliary tasks are essential to the efficacy of Neural Reasoner, without which the 
performances of Neural Reasoner drop dramatically. The reason, as we conjecture in 
Section]^ is that the reasoning task alone cannot give enough supervision for learning 
accurate word vectors and parameters of the RNN encoder. We note that Neural 
Reasoner can still outperform Memory Net (N2N) with lOK data on both tasks. 

• Neural Reasoner with shallow architectures, more specifically two reasoning lay¬ 
ers and 1-layer DNN, apparently can benefit from the auxiliary learning of recovering 
abstract forms on small datasets (IK on both tasks). However, with deeper architec¬ 
tures or more training data, the improvement over that of recovering original sentences 
become smaller, despite the extra information it utilizes. 

• When larger training datasets are available. Neural Reasoner appears to prefer rela¬ 
tively deeper architectures. More importantly, although both tasks require two reason¬ 
ing steps, the performance does not deteriorate with three reasoning layers. On both 
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Posi. Reason. (IK) 

Posi. Reason. (lOK) 

No auxiliary task 



2-layer reasoning, l-layer DNN 

60.2% 

72.1% 

2-layer reasoning, 2-layer DNN 

59.6% 

69.3% 

3-layer reasoning, 3-layer DNN 

58.7% 

59.7% 

Auxiliary task: Original 



2-layer reasoning, 1-layer DNN 

63.1% 

93.8% 

2-layer reasoning, 2-layer DNN 

66.4% 

97.9% 

3-layer reasoning, 3-layer DNN 

69.4% 

99.1% 

Auxiliary task: Abstract 



2-layer reasoning, 1-layer DNN 

70.9% 

95.2% 

2-layer reasoning, 2-layer DNN 

66.6% 

95.6% 

3-layer reasoning, 3-layer DNN 

68.3% 

97.4% 



Path Finding (IK) 

Path Finding (lOK) 

No auxiliary task 



2-layer reasoning, 1-layer DNN 

13.6% 

52.2% 

2-layer reasoning, 2-layer DNN 

12.3% 

54.2% 

3-layer reasoning, 3-layer DNN 

13.1% 

51.7% 

Auxiliary task: Original 



2-layer reasoning, 1-layer DNN 

14.1% 

57.0% 

2-layer reasoning, 2-layer DNN 

17.3% 

87.0% 

3-layer reasoning, 3-layer DNN 

14.0% 

98.4% 

Auxiliary task: Abstract 



2-layer reasoning, 1-layer DNN 

18.1% 

55.8% 

2-layer reasoning, 2-layer DNN 

15.4% 

87.8% 

3-layer reasoning, 3-layer DNN 

11.3% 

98.6% 


Table 3: Results on two reasoning tasks yielded by Neural Reasoner with different archi¬ 
tectural variations. 


tasks, with lOK training instances, Neural Reasoner with three reasoning layers and 
3-layer DNN can achieve over 98% accuracy. 


6 Conclusion and Future Work 

We have proposed Neural Reasoner, a framework for neural network-based reasoning 
over natural language sentences. Neural Reasoner is flexible, powerful, and language 
indepedent. Our empirical studies show that Neural Reasoner can dramatically improve 
upon existing neural reasoning systems on two difficult artificial tasks proposed in [9]. For 
future work, we will explore 1) tasks with higher difficulty and reasoning depth, e.g., tasks 
which require a large number of supporting facts and facts with complex intrinsic structures, 
2) the common structure in different but similar reasoning tasks (e.g., multiple tasks all with 
general questions), and 3) automatic selection of the reasoning architecture, for example, 
determining when to stop the reasoning based on the data. 
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